Free Will

kaggle系列(3):Rental Listing Inquiries(二):XGBoost

上一节我们对数据集进行了初步的探索,并将其可视化,对数据有了初步的了解。这样我们有了之前数据探索的基础之后,就有了对其建模的基础feature,结合目标变量,即可进行模型训练了。我们使用交叉验证的方法来判断线下的实验结果,也就是把训练集分成两部分,一部分是训练集,用来训练分类器,另一部分是验证集,用来计算损失评估模型的好坏。

在Kaggle的希格斯子信号识别竞赛中,XGBoost因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的广泛关注,在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高,最近也有队伍借助XGBoost在比赛中夺得第一。其次,因为它的效果好,计算复杂度不高,也在工业界中有大量的应用。

今天,我们就先来跑一个XGBoost版的Base Model。先回顾一下XGBoost的原理吧:机器学习算法系列(8):XgBoost

一、 准备工作

首先我们导入需要的包:

1
2
3
4
5
6
7
8
9
10
import os
import sys
import operator
import numpy as np
import pandas as pd
from scipy import sparse
import xgboost as xgb
from sklearn import model_selection,preprocessing,ensemble
from sklearn.metrics import log_loss
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

其中一些包的用途会在之后具体用到的时候进行讲解。

导入我们的数据:

1
2
3
4
5
6
7
8
9
10
data_path = '../data/'
train_file = data_path + "train.json"
test_file = data_path +"test.json"
train_df = pd.read_json(train_file)
test_df = pd.read_json(test_file)
print train_df.shape
print test_df.shape
(49352, 15)
(74659, 14)

查看一下前两行:

1
train_df.head(2)


二、特征构建

我们不需要对数值型数据进行任何的预处理,所以首先建立一个数值型特征的列表,纳入features_to_use

1
features_to_use = ["bathrooms","bedrooms","latitude","longitude","price"]

现在让我们根据已有的一些特征来构建一些新的特征:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 照片数量(num_photos)
train_df['num_photos']=train_df['photos'].apply(len)
test_df['num_photos']=train_df['photos'].apply(len)
# 特征数量
train_df['num_features']=train_df['features'].apply(len)
test_df['num_features']=test_df['features'].apply(len)
# 描述词汇数量
train_df['num_description_words'] = train_df['description'].apply(lambda x: len(x.split(" ")))
test_df['num_description_words'] = test_df['description'].apply(lambda x: len(x.split(" ")))
#把创建的时间分解为多个特征
train_df['created']=pd.to_datetime(train_df['created'])
test_df['created']=pd.to_datetime(test_df['created'])
#让我们从时间中分解出一些特征,比如年,月,日,时
#年
train_df['created_year'] = train_df['created'].dt.year
test_df['created_year'] = test_df['created'].dt.year
#月
train_df['created_month'] = train_df['created'].dt.month
test_df['created_month'] = test_df['created'].dt.month
#日
train_df['created_day'] = train_df['created'].dt.day
test_df['created_day'] = test_df['created'].dt.day
#时
train_df['created_hour'] = train_df['created'].dt.hour
test_df['created_hour'] = test_df['created'].dt.hour
#把这些特征都放到所需特征列表中(上面已经创建,并加入了数值型特征)
features_to_use.extend(["num_photos","num_features","num_description_words","created_year","created_month","created_day","created_hour","listing_id"])

我们有四个分类型的特征:

  • display_address
  • manager_id
  • building_id
  • street_address

可以对它们分别进行特征编码:

1
2
3
4
5
6
7
8
categorical = ["display_address","manager_id",'building_id',"street_address"]
for f in categorical:
if train_df[f].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_df[f].values)+list(test_df[f].values))
train_df[f] = lbl.transform(list(train_df[f].values))
test_df[f] = lbl.transform(list(test_df[f].values))
features_to_use.append(f)

还有一些字符串类型的特征,可以先把它们合并起来

1
2
3
4
train_df["features"] = train_df["features"].apply(lambda x:" ".join(["_".join(i.split(" "))for i in x]))
print train_df['features'].head(2)
test_df['features'] = test_df["features"].apply(lambda x: " ".join(["_".join(i.split(" "))for i in x]))
print test_df['features'].head(2)

得到的字符串结果如下:

10000 Doorman Elevator Fitness_Center Cats_Allowed D…
100004 Laundry_In_Building Dishwasher Hardwood_Floors…

然后CountVectorizer类来计算TF-IDF权重

1
2
3
tfidf = CountVectorizer(stop_words ="english",max_features=200)
tr_sparse = tfidf.fit_transform(train_df["features"])
te_sparse = tfidf.transform(test_df["features"])

这里我们需要提一点,对数据集进行特征变换时,必须同时对训练集和测试集进行操作。现在把这些处理过的特征放到一个集合中(横向合并)

1
2
train_X = sparse.hstack([train_df[features_to_use],tr_sparse]).tocsr()
test_X = sparse.hstack([test_df[features_to_use],te_sparse]).tocsr()

然后把目标变量转换为0、1、2,如下

1
2
3
4
5
target_num_map = {'high':0 , 'medium':1 , 'low':2}
train_y = np.array(train_df['interest_level'].apply(lambda x: target_num_map[x]))
print train_X.shape,test_X.shape
(49352, 217) (74659, 217)

可以看到,经过上面一系列的变量构造之后,其数量已经达到了217个。

接下来就可以进行建模啦。

三、XGB建模

先写一个通用的XGB模型的函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def runXGB(train_X,train_y,test_X,test_y=None,feature_names=None,seed_val=0,num_rounds=1000):
#参数设定
param = {}
param['objective'] = 'multi:softprob'#多分类、输出概率值
param['eta'] = 0.1#学习率
param['max_depth'] = 6#最大深度,越大越容易过拟合
param['silent'] = 1#打印提示信息
param['num_class'] = 3#三个类别
param['eval_metric']= "mlogloss"#对数损失
param['min_child_weight']=1#停止条件,这个参数非常影响结果,控制叶子节点中二阶导的和的最小值,该参数值越小,越容易 overfitting。
param['subsample'] =0.7#随机采样训练样本
param['colsample_bytree'] = 0.7# 生成树时进行的列采样
param['seed'] = seed_val#随机数种子
num_rounds = num_rounds#迭代次数
plst = list(param.items())
xgtrain = xgb.DMatrix(train_X,label=train_y)
if test_y is not None:
xgtest = xgb.DMatrix(test_X,label=test_y)
watchlist = [(xgtrain,'train'),(xgtest,'test')]
model = xgb.train(plst,xgtrain,num_rounds,watchlist,early_stopping_rounds=20)
# early_stopping_rounds 当设置的迭代次数较大时,early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
else:
xgtest = xgb.DMatrix(test_X)
model = xgb.train(plst,xgtrain,num_rounds)
pred_test_y = model.predict(xgtest)
return pred_test_y,model

函数返回的是预测值和模型。

5折交叉验证将训练集划分为五份,其中的一份作为验证集。

1
2
3
4
5
6
7
8
9
10
cv_scores = []
kf = model_selection.KFold(n_splits=5,shuffle=True,random_state=2016)
for dev_index,val_index in kf.split(range(train_X.shape[0])):
dev_X,val_X = train_X[dev_index,:],train_X[val_index,:]
dev_y,val_y = train_y[dev_index],train_y[val_index]
pred,model = runXGB(dev_X,dev_y,val_X,val_y)
cv_scores.append(log_loss(val_y,preds))
print cv_scores
break

结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[0] train-mlogloss:1.04135 test-mlogloss:1.04229
Multiple eval metrics have been passed: 'test-mlogloss' will be used for early stopping.
Will train until test-mlogloss hasn't improved in 20 rounds.
[1] train-mlogloss:0.989004 test-mlogloss:0.99087
[2] train-mlogloss:0.944233 test-mlogloss:0.947047
[3] train-mlogloss:0.90536 test-mlogloss:0.908933
[4] train-mlogloss:0.872054 test-mlogloss:0.876526
[5] train-mlogloss:0.841783 test-mlogloss:0.847383
[6] train-mlogloss:0.815921 test-mlogloss:0.822307
[7] train-mlogloss:0.793337 test-mlogloss:0.800476
[8] train-mlogloss:0.773562 test-mlogloss:0.781413
[9] train-mlogloss:0.754927 test-mlogloss:0.76381
[10] train-mlogloss:0.738299 test-mlogloss:0.747959
······
······
[367] train-mlogloss:0.348196 test-mlogloss:0.548011
[368] train-mlogloss:0.347768 test-mlogloss:0.547992
[369] train-mlogloss:0.347303 test-mlogloss:0.548021
[370] train-mlogloss:0.346807 test-mlogloss:0.548065
[371] train-mlogloss:0.346514 test-mlogloss:0.548079
[372] train-mlogloss:0.34615 test-mlogloss:0.548097
[373] train-mlogloss:0.345859 test-mlogloss:0.548111
[374] train-mlogloss:0.345377 test-mlogloss:0.548081
[375] train-mlogloss:0.344961 test-mlogloss:0.548068
[376] train-mlogloss:0.344493 test-mlogloss:0.548024
[377] train-mlogloss:0.344086 test-mlogloss:0.547975
Stopping. Best iteration:
[357] train-mlogloss:0.352182 test-mlogloss:0.547867

迭代357次之后,在训练集上的对数损失为0.352182,在验证集上的损失为0.5478。

然后在对测试集进行预测:

1
preds,model=runXGB(train_X,train_y,test_X,num_rounds=400)

把结果按照比赛规定的格式写入csv文件:

1
2
3
4
out_df = pd.DataFrame(preds)
out_df.columns = ["high", "medium", "low"]
out_df["listing_id"] = test_df.listing_id.values
out_df.to_csv("xgb_starter2.csv", index=False)

看一下最后的结果:

提交到kaggle上,这样我们整个建模的过程就完成了。

接下来两节中,我们重点讲一讲关于XGBoost的调参经验以及使用SK-learn计算TF-IDF。



应统联盟


连接十万名应统专业同学


阿药算法


打通算法面试任督二脉